The Lexer Hack
   HOME

TheInfoList



OR:

In
computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as ana ...
, the lexer hack is a common solution to the problems in
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
ANSI C ANSI C, ISO C, and Standard C are successive standards for the C programming language published by the American National Standards Institute (ANSI) and ISO/IEC JTC 1/SC 22/WG 14 of the International Organization for Standardization (ISO) and th ...
, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual information of the phrase structure, which prevents one from having a context-free lexer.


Problem

The problem is that in the following code, the lexical class of A cannot be determined without further contextual information: (A)*B This code could be multiplication of two variables, in which case A is a variable; written unambiguously: A * B Alternatively, it could be casting the dereferenced value of B to the type A, in which case A is a typedef name; written unambiguously: (A) (*B) In more detail, in a
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
, the lexer performs one of the earliest stages of converting the
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the wo ...
to a program. It scans the text to extract meaningful ''tokens'', such as words, numbers, and strings. The parser analyzes sequences of tokens attempting to match them to syntax rules representing language structures, such as loops and variable declarations. A problem occurs here if a single sequence of tokens can ambiguously match more than one syntax rule. This ambiguity can happen in C if the lexer does not distinguish between variable and
typedef typedef is a reserved keyword in the programming languages C, C++, and Objective-C. It is used to create an additional name (''alias'') for another data type, but does not create a new type, except in the obscure case of a qualified typedef of ...
identifiers. For example, in the C expression: (A) * B the lexer may find these tokens: # left parenthesis # identifier 'A' # right parenthesis # operator '*' # identifier 'B' The problem is precisely that the lexical class of ''A'' cannot be determined without further context: the parser can interpret this as variable ''A'' multiplied by ''B'' or as type ''A'' casting the dereferenced value of ''B''. This is known as the "typedef-name: identifier" problem, due to the name of the problematic production rule.


The hack solution

The solution generally consists of feeding information from the semantic
symbol table In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier (or symbols), constants, procedures and functions in a program's source code is associated with info ...
back into the lexer. That is, rather than functioning as a pure one-way
pipeline Pipeline may refer to: Electronics, computers and computing * Pipeline (computing), a chain of data-processing stages or a CPU optimization found on ** Instruction pipelining, a technique for implementing instruction-level parallelism within a s ...
from the lexer to the parser, there is a backchannel from semantic analysis back to the lexer. This mixing of parsing and semantic analysis is generally regarded as inelegant, which is why it is called a "
hack Hack may refer to: Arts, entertainment, and media Games * ''Hack'' (Unix video game), a 1984 roguelike video game * ''.hack'' (video game series), a series of video games by the multimedia franchise ''.hack'' Music * ''Hack'' (album), a 199 ...
". Without added context, the lexer cannot distinguish type identifiers from other identifiers because all identifiers have the same format. With the hack in the above example, when the lexer finds the identifier ''A'' it should be able to classify the token as a type identifier. The rules of the language would be clarified by specifying that typecasts require a type identifier and the ambiguity disappears. The problem also exists in
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
and parsers can use the same hack.


Alternative solutions

This problem does not arise (and hence needs no "hack" in order to solve) when using
lexerless parsing In computer science, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeli ...
techniques, as these are intrinsically contextual. These are generally seen as less elegant designs, however, because they lack the modularity of having a
concurrent Concurrent means happening at the same time. Concurrency, concurrent, or concurrence may refer to: Law * Concurrence, in jurisprudence, the need to prove both ''actus reus'' and ''mens rea'' * Concurring opinion (also called a "concurrence"), a ...
lexer and parser in a pipeline. Some parser generators, such as the byacc-derived BtYacc ("Backtracking Yacc"), give the generated parser the ability to try multiple attempts to parse the tokens. In the problem described here, if an attempt fails because of semantic information about the identifier, it can backtrack and attempt other rules. The
Clang Clang is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection (GCC), ...
parser handles the situation in a completely different way, namely by using a non-reference lexical grammar. Clang's lexer does not attempt to differentiate between type names and variable names: it simply reports the current token as an identifier. The parser then uses Clang's semantic analysis library to determine the nature of the identifier. This allows a much cleaner
separation of concerns In computer science, separation of concerns is a design principle for separating a computer program into distinct sections. Each section addresses a separate '' concern'', a set of information that affects the code of a computer program. A concern ...
and encapsulation of the lexer and parser, and is therefore considered by some compiler developers to be a much more elegant solution than The Lexer Hack. This is also the approach used in most other modern languages, which do not distinguish different classes of identifiers in the lexical grammar, but instead defer them to the parsing or semantic analysis phase, when sufficient information is available.


See also

*
Dangling else The dangling else is a problem in programming of parser generators in which an optional else clause in an if–then(–else) statement results in nested conditionals being ambiguous. Formally, the reference context-free grammar of the language i ...


References


Citations

* http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elkhound/index.html * http://cs.nyu.edu/rgrimm/papers/pldi06.pdf * http://cens.ioc.ee/local/man/CompaqCompilers/ladebug/ladebug-manual-details.html
DOI.org


* http://groups.google.com/group/comp.compilers/browse_frm/thread/db7f68e9d8b49002/fa20bf5de9c73472?lnk=st&q=%2B%22the+lexer+hack%22&rnum=1&hl=en#fa20bf5de9c73472 {{DEFAULTSORT:Lexer hack, The C (programming language) C++ Parsing Articles with example C code